# Deep Learning-Based Multiple Object Visual Tracking on Embedded System for IoT and Mobile Edge Computing Applications

Beatriz Blanco-Filgueira<sup>®</sup>, Daniel García-Lesta<sup>®</sup>, Mauro Fernández-Sanjurjo, Víctor Manuel Brea<sup>®</sup>, and Paula López<sup>®</sup>, *Member, IEEE* 

Abstract—Compute and memory demands of state-of-the-art deep learning methods are still a shortcoming that must be addressed to make them useful at Internet of Things (IoT) endnodes. In particular, recent results depict a hopeful prospect for image processing using convolutional neural networks, CNNs, but the gap between software and hardware implementations is already considerable for IoT and mobile edge computing applications due to their high power consumption. This proposal performs low-power and real time deep learning-based multiple object visual tracking implemented on an NVIDIA Jetson TX2 development kit. It includes a camera and wireless connection capability and it is battery powered for mobile and outdoor applications. A collection of representative sequences captured with the on-board camera, dETRUSC video dataset, is used to exemplify the performance of the proposed algorithm and to facilitate benchmarking. The results in terms of power consumption and frame rate demonstrate the feasibility of deep learning algorithms on embedded platforms although more effort in the joint algorithm and hardware design of CNNs is needed.

Index Terms—Deep learning, edge computing, foreground segmentation, Internet of Things (IoT) node, visual tracking.

## I. Introduction

VISUAL tasks, such as object detection, classification, or tracking are essential for many practical applications at the perception or sensor layer of an Internet of Things (IoT) architecture [1]. These features rely on a large quantity of data and require prompt response, that is, the scene must be captured and processed for real time decision making. Defining real time as live performance, understood as the capability of solving the application problem using at any time only the available frame from the live camera without storing intermediate frames for delayed processing, this requirement cannot be efficiently accomplished by cloud computing due

Manuscript received July 31, 2018; revised February 1, 2019; accepted February 18, 2019. Date of publication February 27, 2019; date of current version June 19, 2019. This work was supported in part by the Spanish Government project RTI2018-097088-B-C32 MICINN (FEDER), in part by the Consellería de Cultura, Educación e Ordenación Universitaria (accreditation 2016-2019, ED431G/08, and reference competitive group 2017-2020, ED431C 2017/69), and in part by the European Regional Development Fund (ERDF). (Corresponding author: Beatriz Blanco-Filgueira.)

The authors are with the Centro Singular de Investigación en Tecnoloxías da Información, Universidade de Santiago de Compostela, 15782 Santiago de Compostela, Spain (e-mail: blancofilgueira@edu.xunta.es).

Digital Object Identifier 10.1109/JIOT.2019.2902141

to latency and the prospect of poor coverage. In such cases computation is moved to the edges of the networks, that is, the sensor layer of IoT, giving rise to what is called edge computing [2]. The energy efficiency of such visual sensing nodes is one of the key criteria that lead their design since interconnected devices of the IoT infrastructure need to be battery-powered in many situations, such as mobile and outdoor applications. Even further, they could be self-powered by means of energy harvesting techniques, for instance.

Image processing solutions range from traditional computer vision algorithms to the more recent application of deep learning-based strategies. The potential of the later has led image processing to a new extent during the last years [3]. Convolutional neural networks, known as CNNs [4], along with recent computational capabilities, offer a new strategy to address computer vision tasks. Among the advantages of CNNs in comparison with traditional computer vision algorithms are their robustness and accuracy. Whereas traditional algorithms are optimized for a concrete goal and particular conditions, CNNs are trained in a massive way to undertake more general challenges. Although the training phase is usually highly demanding in terms of computation, the benefits of the CNN can be exploited using less sophisticated hardware resources once the CNN has been trained.

The results of the last years depict a hopeful prospect for image processing using CNNs. Efforts have been focused on a wide range of applications, such as image classification and object detection [5], segmentation [6], and object tracking [7]. Despite the promising results in most of them, object tracking using deep features and CNNs has only recently emerged. One of the state-of-the-art benchmarks for object tracking is the visual object tracking (VOT) challenge [8] and the winners of the last years were-based both on deep learning techniques [9], [10] and deep features [11], [12].

Despite the rapid development of deep learning methods, and CNNs for computer vision purposes particularly, the gap between software and hardware implementations is already considerable [13]. In order to make state-of-the-art networks useful at IoT end-nodes, more attention must be paid to their power consumption and compute and memory demands. Most of the CNNs energy consumption is related to data movement rather than computation itself [14]. Hence, it is necessary to make a great effort not only at network design but also at hardware implementation in order to exploit CNNs potential in the

computer vision field for low-power real time mobile systems and IoT applications. In fact, some authors have recently highlighted the importance of the joint algorithm and hardware design of neural networks [13].

Algorithms and CNNs are evolving and competing continuously to be more accurate and faster and they are usually tested with benchmark databases over powerful hardware platforms. On the other hand, manufacturers and scientists try to diversify the available hardware options to provide the required resources to implement and accelerate the performance of those demanding networks. However, there is a lack of end-toend IoT end-nodes performing image capture, processing, and communication. For this purpose, embedded platforms, such as the NVIDIA Jetson TX2 are probably the most suitable choice in terms of design speed and cost effectiveness to develop proof of concept and even final solutions to a wide range of applications. Moreover, Jetson TX2 is offered in a ready to use development kit but also as a single board computer with a size of  $50 \times 87$  mm and 85 g weight. The development kit includes a camera and wireless connection capability, thus development from beginning to end can be accomplished from an early stage [15].

In this paper, low-power and real time deep learning-based multiple object tracking is implemented on an NVIDIA Jetson TX2 development kit. Section II offers a discussion about hardware availability and our choice. Next, the implementation of the GOTURN CNN tracker as a multiple object tracking proposal is described in Section III. A hardware-oriented pixel-based adaptive segmenter (HO-PBAS) algorithm [16] is used to detect moving objects and it is integrated with the GOTURN CNN-based tracker [17]. Moreover, additional code was included to manage multiple object tracking. Section IV shows the performance of the proposed algorithm over a collection of representative sequences captured with the onboard camera, dETRUSC video dataset [18], and the results in terms of power consumption and velocity. Finally, the main conclusions are summarized in Section V.

# II. HARDWARE IMPLEMENTATION

CNNs offer increasing accuracy for visual tasks at the expense of huge computation and memory resources. Current hardware solutions can satisfy these demands at the cost of high energy consumption, specially when real time processing is also a requirement, understood as the capability of solving the application problem using at any time only the available frame provided by the live camera without intermediate storage for delayed processing. That is the case of edge computing solutions, where the computation is performed close to the data source in order to avoid cloud transfer and thus improve response time [2]. Edge computing still benefits from energy savings due to no cloud transfer need, but it also requires embedded low-power consumption nodes. In order to exploit the sate-of-the-art CNNs for real applications at IoT end-nodes, low-power embedded solutions must be explored whereas maintaining reasonable accuracy and performance.

Diverse hardware solutions for CNN inference acceleration have been proposed in recent years. They range from standalone solutions to heterogeneous systems and systems-on-chip (SoC), which can include field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), CPUs, and graphics processing units (GPUs).

Focusing on visual tasks-oriented proposals, those based on FPGAs stand out in terms of energy efficiency but not in performance [19], [20]. Additionally, some of them are supported in other elements as CPUs or external DRAM memory [21]–[24], the networks used as benchmark are not always representative [25], [26] or their price limits the application range [27], [28].

In terms of ASICs, several accelerators for CNNs as an alternative to CPUs and GPUs for computer vision tasks as image classification [29], face detection [30], and recognition [31], have been presented during the last year. They present a very low consumption but long development cycles, high cost, and little flexibility to adapt to the rapid progress of the algorithms.

Regarding CPUs, Intel Knights Landing [32] and Intel Knights Mill [33] generations of Xeon Phi x86 CPUs are optimized for deep learning but conceived for training in supercomputers, servers, and high-end workstations. Intel also offers the Aero Compute Board, a purpose-built unmanned aerial vehicle developer kit powered by a quad-core Intel Atom processor.

NVIDIA, one of the most important GPU manufacturers, has also increased its offering for deep learning solutions rapidly [34]. NVIDIA GPUs for deep learning cover data center solutions (DGX systems and Tesla solutions), desktop development (Titan Xp, Quadro GV100, Titan V), and embedded applications (Jetson TX2). NVIDIA DGX systems are fully integrated solutions built on the NVIDIA Volta GPU platform but server oriented. NVIDIA DGX-2 contains 16 Tesla V100 GPUs consuming 10 kW whereas NVIDIA DGX-1 is based on 8 Tesla V100 GPUs with 3.5 kW consumption. Facebook's Big Sur [35] and more recent Big Basin [36] custom deep learning server, with 8 NVIDIA Tesla P100 GPU accelerators, is similar to NVIDIA DGX-1.

NVIDIA Jetson TX2 is a promising AI SoC powered by NVIDIA Pascal GPU architecture for inference at the edge. It is power-efficient, with small dimensions and high throughput for embedded applications. Whereas discrete GPUs consumption ranges between 150 and 250 W, integrated GPU as on the Jetson TX2 swings between 5 and 15 W. Additional advantages are the reduced size and needless active cooling. It is also commercially available as a development kit [15], which also includes a 5MP CSI camera and WLAN and Bluetooth connectivity among others peripherals. Since Jetson TX2 development kit fits all requirements, including a reasonable cost, it was chosen for the scope of this paper.

Jetson TX2 consists of a quad-core 2.0 GHz 64 bit ARMv8 A57 processor, a dual-core 2.0 GHz superscalar ARMv8 Denver processor, and an integrated Pascal GPU 1.3 GHz with 256 cores. The six CPU cores and the GPU share 8 GB DRAM memory. Jetson TX2 includes a command line

| TABLE I                           |
|-----------------------------------|
| NVIDIA JETSON TX2 OPERATION MODES |

| Mode           | Denver (GHz) | A57 (GHz) | GPU (GHz) |
|----------------|--------------|-----------|-----------|
| Max-N          | 2.0          | 2.0       | 1.30      |
| Max-Q          | -            | 1.2       | 0.85      |
| Max-P Core-All | 1.4          | 1.4       | 1.12      |
| Max-P ARM      | -            | 2.0       | 1.12      |
| Max-P Denver   | 2.0          | -         | 1.12      |



Fig. 1. Embedded solution: Jetson TX2 development kit is battery powered and remotely controlled using a tablet and WiFi connection.

tool for switching operation modes at run time, adjusting the CPUs and GPU clock speeds by dynamic voltage and frequency scaling (DVFS), see Table I. Max-Q mode represents the peak of the power/throughput curve, that is, the peak efficiency, which corresponds to 7.5 W consumption and limits the clocks to ensure operation in the most efficient range only. Max-P enables maximum system performance although with higher consumption (15 W maximum) and reduced efficiency. Custom configurations with intermediate frequencies are also allowed for the purpose of balancing between peak efficiency and peak performance. Finally, DVFS can be disabled to run all cores at the maximum speed all time activating full clocks mode while operating at any mode except Max-Q.

A picture of our experimental set-up during execution is shown in Fig. 1. Jetson TX2 development kit is powered by a 3S LiPo battery and remotely controlled by a tablet using WiFi connection. The images shown on the tablet correspond to live performance, where the black and white image on the left represents the detection of the foreground and the camera capture with green and red boxes on the right depicts the tracking. More details can be found in Section IV-A.

## III. MULTIPLE OBJECT TRACKING APPROACH

## A. Overview

A proposal for multiple object tracking was developed making use of the HO-PBAS algorithm [16] to detect moving objects integrated with the GOTURN CNN-based tracker [17]. The HO-PBAS detector [16] is a hardware oriented foreground segmentation method based on the pixel-based adaptive segmenter (PBAS) algorithm [37]. The HO-PBAS is oriented to focal-plane processing, [38], with the benefit of less memory usage, becoming appropriate for future on-chip implementation of the whole algorithm and also for its use on embedded solutions, such as the Jetson platform. Both CPU- and GPUbased implementations of the HO-PBAS were developed. The latter was written with the CUDA version of the OpenCV library functions, using some custom-made low-level CUDA functions as well. However, due to the simplicity of the algorithm, the speed-up obtained from the GPU parallel execution does not compensate for the required time for the data transfer from the global memory to the GPU memory. Thus, the CPU version of the HO-PBAS was chosen, leaving the GPU computational resources available for the tracking algorithm. On the other side, GOTURN [17], that was originally designed for 1-object tracking, has been applied in this paper to multiple object tracking. It was tested on the VOT2014 dataset [39] and it is one of the few that can achieve real time performance with the additional advantage of reasonable hardware requirements. The GOTURN algorithm was implemented using Caffe [40], a deep learning framework which allows a high level of abstraction. Caffe libraries are optimized to use CUDA so low level development is not needed to maximize performance.

Both algorithms have been tested on publicly available datasets and present good results separately. In this paper, they were integrated to address an end-to-end solution over the NVIDIA Jetson TX2 embedded platform described in Section II. It consists in multiple object tracking of real time detected objects. For this purpose, both algorithms must run jointly, obtaining precise object detections from HO-PBAS which are used as inputs for the GOTURN tracker. Regardless of the accuracy of the used detector, its output cannot be as precise as manual annotation of the object to track, which is the approach commonly used for tracker validation, such as in VOT competition. Thus, merely integration of both algorithms is not enough and additional high level development was needed to overcome limitations of a non ideal input annotation for the tracker, as will be explained in the next sections.

A diagram representation of the proposed algorithm is depicted in Fig. 2. The model and the camera are initialized and then detection and tracking of multiple objects is carried out as it is explained in the following sections.

# B. Detection

The output of the HO-PBAS detector are the bounding boxes of the moving objects in the scene which have an area between desired maximum and minimum values. As new objects are detected in the scene, their bounding boxes are sent to the tracker. Typically, trackers are tested on annotated videos where the first frame bounding box is used as input, that is, the object to track. Thus, the object is completely at the scene in the first frame and the manual annotation is ideal. This means that the object can be well characterized by the CNN and the tracker algorithms can be compared in the same terms for challenge purpose, regardless of the detection. However, in a real situation objects come into the scene from the edges and they are detected as soon as they occupy a minimum area, what can occur before they enter the image completely. If a partially cropped object is sent to the tracker, it probably does not manage enough information to treat it as the same object when it appears completely. Therefore, partial object detection must be prevented by only considering completely detected objects. In our implementation, this is accomplished by requiring the bounding boxes to be separated from the image borders by a certain number of pixels.

## C. Tracking of Previous Objects and New Candidates

- 1) After the detection, if moving objects were already detected in previous frames ("previous objects?" Fig. 2), the GOTURN algorithm continues tracking them independently of HO-PBAS performance. To do that, the CNN itself estimates the new position of every tracked object. Additionally, in order to stop the tracking of those that leave the scene, the algorithm checks if any of the bounding boxes are close to the edges of the image ["track (GOTURN) and stop (if needed)," Fig. 2].
- 2) Then, it must be checked whether there is a new object in the scene. The minimum matching between all pairs of previous and current bounding boxes, calculated as the intersection over union [41], is selected as a potential new detection ("match previous and current detections," Fig. 2). In the particular case that more than one new object appeared on the scene, they would be identified and their tracking initialized in subsequent frames.
- 3) After that, it must be determined whether the new detection is really a new object ("new object?," Fig. 2). This is solved by comparing the new detection with all the bounding boxes previously estimated by the tracker. This prevents false positives derived, for instance, from the following circumstances.
  - a) The occasional detector failure over the same moving object which would result in a new object every time the detector fails and restores its performance later.
  - b) Objects that stop for a while, long enough to be treated as background, and are thus identified as new objects by the detector when they move again.
  - c) Objects that move away from the foreground can become too small to be detected by HO-PBAS but they can be tracked by GOTURN and be recognized as previous objects if they come back to the foreground, see Fig. 3(a).

Next, if the new detection is identified as a new object after the checking, a new tracker is initialized ("init track," Fig. 2). Finally, current detections are saved as previous detections for the next frame ("save detections," Fig. 2).



Fig. 2. Conceptual diagram of the multiple object tracking algorithm proposed in this paper.

Although there are more robust state-of-the-art techniques to solve the problem of data association between trackers and detections, such as those well ranked at the MOTChallenge benchmark [42], sometimes their low speed is a drawback for real time applications [43], [44]. Owing to the requirements

to develop an end-to-end solution over a low-power embedded platform for real time performance, the more straightforward solution described here is used instead.

#### IV. RESULTS

The proposed algorithm was written in C++, QVGA resolution frames ( $320 \times 240$  pixels) are processed and the maximum RAM usage is 1.6 GB. Following recent trends and suggestions of other authors, which claim a jointly software and hardware design, the results of the present work focus on the frame rate and power consumption of the algorithm, which are hardware dependent [13], [45].

#### A. dETRUSC Database

The accuracy of a model should be measured on widespread used datasets. The HO-PBAS detector was evaluated on the 2014 updated version of changedetection.net public dataset [46] whereas the GOTURN tracker was tested on VOT2014 dataset [39]. However, in this paper the detector and tracking algorithms were integrated and thus no benchmark for joint performance is available. Additionally, videos are captured with the camera of the Jetson TX2 development kit in order to demonstrate live processing capability for real time decision making. That is, the complete end-node performance aims to demonstrate the feasibility of deep learning techniques for visual tasks on embedded solutions for IoT end-nodes.

Due to the aforementioned reasons, a collection of representative sequences captured with the on-board camera were taken and used to exemplify the performance of the proposed algorithm. The dETRUSC video dataset and the results are publicly available and can be accessed at [18]. The videos represent real scenarios, including diverse challenging situations, such as low light and high contrast conditions, glass reflections, high velocity, and shadows. Fig. 3 shows a pair of captures of the multiple object tracking performance. The black and white image on the left is the HO-PBAS segmentation whereas tracking is depicted on the right. Fig. 3(a) exemplifies the correct tracker performance under low light intensity and high contrast conditions. In Fig. 3(b), the capability of the tracker to follow vehicles is illustrated, although GOTURN does not handle large movements in order to not increase the complexity of the network and thus achieve high frame rate.

### B. Detection and Tracking Performance

The performance of both HO-PBAS detector and GOTURN tracker separately have been previously given in [16] and [17], respectively. Regarding robustness and precision, it must be considered that they are usually calculated for single object tracking. Additionally, the ground truth detection is used to initialize the algorithm and to reinitialize it when precision decreases under a certain threshold. This differs from our joint detection and tracking solution, which initializes the GOTURN tracker with the detection given by the HO-PBAS and is never reinitialized.

Considering the aforementioned aspects, precision and success plots of the proposed complete system using some videos



Fig. 3. Example of multiple object tracking under different challenging conditions. (a) Low light intensity and high contrast conditions. (b) High velocity.

of the dETRUSC dataset, Section IV-A, are depicted in Fig. 4 for completeness, but no comparison with other trackers in challenges like VOT is possible in the same terms as we do not use the ground truth bounding box to initialize and reinitialize the tracker. Fig. 4(a) shows the success plots for three videos. The success plot represents the percentage of successful frames whose overlap between the tracker and the ground truth bounding boxes is larger than the given threshold [41]. The arithmetic mean of the success rate value at 0.5 overlap threshold is 90% in our complete system. Fig. 4(b) represents the precision plots for the same videos. The precision plot is based on the center location error metric, which is defined as the average Euclidean distance between the center locations of the tracked target and the ground truth [41]. Thus, precision plot shows the percentage of frames whose estimated location is within the given threshold distance of the ground truth. A representative precision score is for 20 pixels threshold, which is 100% in our complete system.

#### C. Frame Rate

VOT2017 challenge introduced a real time experiment [47] but, as explained in the report, tracking speed depends not only on the programming effort and skill but also on the used hardware. If no power constraint nor small size are needed, the latency reduces as a more powerful hardware is chosen. Thus, comparison of different algorithms in terms of throughput (GOPS), latency (s), or frame rate (f/s) takes on meaning only when they all run on the same hardware.

To the best of our knowledge, there are not previous solutions for multiple object detection and tracking running on Jetson TX2, thus this proposal cannot be compared with others in terms of velocity. Just in a recent work [48], the latency and throughput of different CNNs only for object recognition



Fig. 4. Success and precision plots for some videos of the dETRUSC dataset and arithmetic mean. (a) Success plots. (b) Precision plots.

Fig. 5. Frame rate of the multiple object tracking algorithm for different number of tracked objects and Jetson TX2 operation modes. (a) Normal operation. (b) Full clocks mode.

or detection are characterized, achieving a maximum frame rate of 4.3 f/s in full clocks mode for object detection, what it is not enough for real time inference at the edge.

In order to measure the maximum algorithm speed over the Jetson, regardless of the capture frame rate, prerecorded video processing is performed instead of live capture. The metrics for the frame rate of the proposed model as a function of the number of tracked objects were measured for all the available operation modes and full clocks mode, Fig. 5. As the number of objects increases, the algorithm processes the frames more slowly, although a saturation tendency is observed. The most efficient operation mode, Max-Q, also offers the worst performance in terms of velocity. Regarding the full clocks mode, it improves the one object tracking performance but no significant profit is observed for multiple object purpose.

In view of these results, real time performance was evaluated at Max-N operation mode and 10 f/s video capture, using at any time only the available frame from the live camera without storing intermediate frames for delayed processing. Satisfactory results were found in outdoor and indoor scenarios, under low light intensity and high contrast conditions and fast moving objects, such as vehicles. Video results can be accessed at the dETRUSC database site [18].

#### D. Power Consumption

Consumption is one of the main limiting factors for deep learning application on embedded IoT end-nodes. The opportunities that CNNs offer for image processing are undeniable but their practical use and expansion will highly depend on the hardware design and the development of hardware-oriented algorithms [45]. The energy consumption of CNNs is dominated by data movement instead of the computation itself [13]. Fortunately, the most costly operations in terms of data movement are highly parallel.

The total power consumption of the proposed multiple object tracking algorithm running on NVIDIA Jetson TX2 development kit is shown in Fig. 6. It was measured for the different operation modes of the board summarized in Table I. The increments of the bars represent the additional consumption due to full clocks mode activation, which forces to run all cores at the maximum speed all time. As can be observed, the latter has little effect for Max-Q mode since it limits the clocks to ensure operation in the most efficient range. The power was also measured as a function of the number of tracked objects, finding that power consumption softly increases with the presence of more moving objects in the scene. An analogous representation of the power consumed by the CPUs



Fig. 6. Total power consumption of the system for different number of tracked objects and Jetson TX2 operation modes. The bar increment represents the additional energy consumption on full clocks mode.



Fig. 7. Power consumption of (a) CPUs and (b) GPU for different number of tracked objects and Jetson TX2 operation modes.

 $(P_{\mathrm{CPU}})$  and GPU  $(P_{\mathrm{GPU}})$  is depicted in Fig. 7. An increment of the GPU utilization is observed through a greater power consumption as the number of tracked objects increases. On the contrary, CPUs consumption diminishes as GPU workload raises because they must await GPU completion.

As expected, the Max-Q operation mode is the most costeffective in terms of energy, regardless of metric, ranging from 4.57 to 6.21 W. On the contrary, Max-N is the most expensive with 12 W maximum consumption since all CPUs and GPU run at maximum clock speeds, see Table I. Max-P modes are a balance between both, running all CPUs, only ARM A57 processors, or only Denver processors. That is to say, using a 3S 6000 mAh LiPo battery as in our experimental set-up, an autonomy of 6 h under continuous performance at maximum power consumption is guaranteed.

If we compare our proposal with the state-of-the-art detectors and trackers we can observe that improving the accuracy has a great impact on velocity and power consumption. For instance, Faster R-CNN [49] and R-FCN [50] detectors can run at 5–17 f/s and 6 f/s, respectively, on a K40 GPU, which has a power consumption of 235 W. That is, a similar frame rate as ours is achieved but performing only the detection and with 20 times more power consumption. In turn, MDNet [9] tracker achieves a speed of 1 f/s running on a K20 GPU, 225 W

power consumption, whereas high frame rate of 58–86 f/s is obtained with SiameseFC [10] tracker but using a GeForce GTX Titan X, which consumes 250 W. Moreover, those data correspond to only 1-object tracking and do not consider the detection phase. Additionally, we are not considering CPUs consumption in these detectors and trackers results, nor comparing dimensions and price of the hardware, nor memory usage, which are also penalized. Summarizing, what we propose is an algorithm that solves the application problem in real time, with low memory usage, and implemented on a small dimensions hardware platform, which also has an affordable price and with a low-power consumption of 12 W.

#### V. CONCLUSION

An end-to-end solution for real time deep learning-based multiple object tracking in an embedded and low-power IoT oriented platform is presented. NVIDIA Jetson TX2 [15] development kit was chosen because it allows development from beginning to end from the early stage, including camera and wireless connection. It was powered by a LiPo battery and remotely controlled using a tablet. Another strengths of the platform are price and dimensions.

The HO-PBAS algorithm [16] to detect moving objects was integrated with the GOTURN CNN-based tracker [17] in order to perform multiple object tracking. Whereas both algorithms present good results separately, additional development was needed to overcome some limitations due to the jointly operation.

Qualitative results over the dETRUSC video dataset captured with the on-board camera can be found at [18]. Good results were found under real challenging scenarios, including low light and high contrast conditions, glass reflections, high velocity, and shadows. Moreover, the good performance of the proposed algorithm was also exemplified in terms of robustness and precision over videos of the same dataset. Regarding velocity and power consumption, real time performance at 10 f/s video capture, using at any time only the available frame from the live camera without intermediate storage for delayed processing, with a total power consumption of only 12 W is achieved.

The memory and computational resources are sufficient to run the complete solution as in a standard platform, which results in a lower power consumption at the expense of a reduced frame rate. However, if more complexity was needed or more vision algorithms were added, the embedded system could be limited. Thus, a joint algorithm and hardware design is essential in the future of IoT, specially for those applications, such as visual tasks which require live performance and thus cloud computing is not an alternative due to latency or poor coverage.

# REFERENCES

- M. Rusci, D. Rossi, E. Farella, and L. Benini, "A sub-mW IoT-endnode for always-on visual monitoring and smart triggering," *IEEE Internet Things J.*, vol. 4, no. 5, pp. 1284–1295, Oct. 2017.
- [2] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, "Edge computing: Vision and challenges," *IEEE Internet Things J.*, vol. 3, no. 5, pp. 637–646, Oct. 2016.

- [3] Y. LeCun, Y. Bengio, and G. Hilton, "Deep learning," *Nature*, vol. 521, pp. 436–444, May 2015.
- [4] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," *Proc. IEEE*, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
- [5] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proc. IEEE Conf. Comput. Vis. Pattern Recognit.* (CVPR), 2016, pp. 770–778.
- [6] E. Shelhamer, J. Long, and T. Darrell, "Fully convolutional networks for semantic segmentation," *IEEE Trans. Pattern Anal. Mach. Intell.*, vol. 39, no. 4, pp. 640–651, Apr. 2017.
- [7] E. Gundogdu and A. A. Alatan, "Good features to correlate for visual tracking," *IEEE Trans. Image Process.*, vol. 27, no. 5, pp. 2526–2540, May 2018.
- [8] VOT Challenges. Accessed: Jul. 31, 2018. [Online]. Available: http://www.votchallenge.net/challenges.html
- [9] H. Nam and B. Han, "Learning multi-domain convolutional neural networks for visual tracking," in *Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)*, 2016, pp. 4293–4302.
- [10] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, "Fully convolutional Siamese networks for object tracking," in *Proc. Eur. Conf. Comput. Vis. (ECCV)*, 2016, pp. 850–865.
- [11] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, "Beyond correlation filters: Learning continuous convolution operators for visual tracking," in *Proc. Eur. Conf. Comput. Vis.* (ECCV), 2016, pp. 472–488.
- [12] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, "ECO: Efficient convolution operators for tracking," in *Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)*, 2017, pp. 6638–6646.
- [13] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer, "Efficient processing of deep neural networks: A tutorial and survey," *Proc. IEEE*, vol. 105, no. 12, pp. 2295–2329, Dec. 2017.
- [14] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the future of parallel computing," *IEEE Micro*, vol. 31, no. 5, pp. 7–17, Sep./Oct. 2011.
- [15] NVIDIA Jetson. The Embedded Platform for Autonomous Everything. Accessed: Jul. 31, 2018. [Online]. Available: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems-dev-kits-modules/
- [16] D. García-Lesta, P. López, V. M. Brea, and D. Cabello, "In-pixel analog memories for a pixel-based background subtraction algorithm on CMOS vision sensors," *Int. J. Circuit Theory Appl.*, vol. 46, no. 9, pp. 1631–1647, 2018.
- [17] D. Held, S. Thrun, and S. Savarese, "Learning to track at 100 f/s with deep regression networks," in *Proc. Eur. Conf. Comput. Vis. (ECCV)*, 2016, pp. 749–765.
- [18] dETRUSC Video Dataset. Accessed: Jul. 31, 2018. [Online]. Available: https://citius.usc.es/investigacion/datasets/detrusc
- [19] E. Nurvitadhi et al., "Can FPGAs beat GPUs in accelerating next-generation deep neural networks?" in Proc. ACM/SIGDA Int. Symp. Field Program. Gate Arrays (FPGA), 2017, pp. 5–14.
- [20] K. Guo et al., "Angel-eye: A complete design flow for mapping CNN onto embedded FPGA," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 37, no. 1, pp. 35–47, Jan. 2018.
- [21] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, "A 240 G-ops/s mobile coprocessor for deep neural networks," in *Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)*, 2014, pp. 682–687.
- [22] J. Qiu et al., "Going deeper with embedded FPGA platform for convolutional neural network," in Proc. ACM/SIGDA Int. Symp. Field Program. Gate Arrays (FPGA), 2016, pp. 26–35.
- [23] N. Shah, P. Chaudhari, and K. Varghese, "Runtime programmable and memory bandwidth optimized FPGA-based coprocessor for deep convolutional neural network," *IEEE Trans. Neural Netw. Learn. Syst.*, vol. 29, no. 12, pp. 5922–5934, Dec. 2018.
- [24] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, "Optimizing the convolution operation to accelerate deep neural networks on FPGA," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 26, no. 7, pp. 1354–1367, Jul. 2018.
- [25] C. Wang et al., "DLAU: A scalable deep learning accelerator unit on FPGA," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 36, no. 3, pp. 513–517, Mar. 2017.
- [26] C. Zhang et al., "Optimizing FPGA-based accelerator design for deep convolutional neural networks," in Proc. ACM/SIGDA Int. Symp. Field Program. Gate Arrays, 2015, pp. 161–170.
- [27] Y. Qiao et al., "FPGA-accelerated deep convolutional neural networks for high throughput and energy efficiency," Concurrency Comput. Pract. Exp., vol. 29, no. 20, 2017, Art. no. e3850.

- [28] M. Bettoni, G. Urgese, Y. Kobayashi, E. Macii, and A. Acquaviva, "A convolutional neural network fully implemented on FPGA for embedded platforms," in *Proc. New Gener. CAS (NGCAS)*, 2017, pp. 49–52.
- [29] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks," *IEEE J. Solid-State Circuits*, vol. 52, no. 1, pp. 127–138, Jan. 2017.
- [30] S. Bang et al., "1.47 A 288μW programmable deep-learning processor with 270KB on-chip weight storage using non-uniform memory hierarchy for mobile intelligence," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2017, pp. 250–251.
- [31] Z. Du et al., "An accelerator for high efficient vision processing," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 36, no. 2, pp. 227–240, Feb. 2017.
- [32] Intel Xeon Phi Processor 7290F. Accessed: Jul. 31, 2018. [Online]. Available: https://ark.intel.com/products/95831/Intel-Xeon-Phi-Processor-7290F-16GB-1\_50-GHz-72-core
- [33] Intel Xeon Phi Processor 7295. Accessed: Jul. 31, 2018. [Online]. Available: https://ark.intel.com/products/128690/Intel-Xeon-Phi-Processor-7295-16GB-1\_5-GHz-72-Core
- [34] Deep Learning AI Developer. Accessed: Jul. 31, 2018. [Online]. Available: https://www.nvidia.com/en-us/deep-learning-ai/developer/
- [35] Facebook to Open-Source AI Hardware Design. Accessed: Jul. 31, 2018. [Online]. Available: https://code.facebook.com/posts/ 1687861518126048/facebook-to-open-source-ai-hardware-design/
- [36] Introducing Big Basin: Our Next-Generation AI Hardware. Accessed: Jul. 31, 2018. [Online]. Available: https://code.facebook.com/ posts/1835166200089399/introducing-big-basin-our-next-generation-aihardware/
- [37] M. Hofmann, P. Tiefenbacher, and G. Rigoll, "Background segmentation with feedback: The pixel-based adaptive segmenter," in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2012, pp. 38–43.
- [38] M. Suárez et al., "Low-power CMOS vision sensor for Gaussian pyramid extraction," *IEEE J. Solid-State Circuits*, vol. 52, no. 2, pp. 483–495, Feb. 2017.
- [39] M. Kristan et al., "The visual object tracking VOT2014 challenge results," in Proc. Eur. Conf. Comput. Vis. (ECCV), 2014, pp. 191–217.
- [40] Y. Jia et al., "Caffe: Convolutional architecture for fast feature embedding," in Proc. ACM Multimedia Syst. Conf., 2014, pp. 675–678.
- [41] Y. Wu, J. Lim, and M.-H. Yang, "Online object tracking: A benchmark," in *Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)*, 2013, pp. 2411–2418.
- [42] The Multiple Object Tracking Challenge. Accessed: Jul. 31, 2018. [Online]. Available: https://motchallenge.net/
- [43] R. Henschel, L. L.-T. D. Cremers, and B. Rosenhahn, "Fusion of head and full-body detectors for multi-object tracking," in *Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)*, 2017, pp. 1541–1550.
- [44] M. Keuper et al. (2016). A Multi-Cut Formulation for Joint Segmentation and Tracking of Multiple Objects. [Online]. Available: https://arxiv.org/abs/1607.06317
- [45] V. Sze, Y.-H. Chen, J. Emer, A. Suleiman, and Z. Zhang, "Hardware for machine learning: Challenges and opportunities," in *Proc. IEEE Custom Integr. Circuits Conf. (CICC)*, 2017, pp. 1–9.
- [46] N. Goyette, P.-M. Jodoin, F. Porikli, J. Konrad, and P. Ishwar, "Changedetection.net: A new change detection benchmark dataset," in *Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW)*, 2012, pp. 1–8.
- [47] M. Kristan et al., "The visual object tracking VOT2017 challenge results," in Proc. Int. Conf. Comput. Vis. (ICCV), 2017, pp. 1949–1972.
- [48] J. Hanhirova et al., "Latency and throughput characterization of convolutional neural networks for mobile computer vision," in Proc. ACM Multimedia Syst. Conf., 2018, pp. 204–215.
- [49] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," *IEEE Trans. Pattern Anal. Mach. Intell.*, vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
- [50] J. Dai, Y. Li, K. He, and J. Sun, "R-FCN: Object detection via region-based fully convolutional networks," in *Proc. Adv. Neural Inf. Process. Syst.*, 2016, pp. 379–387.



**Beatriz Blanco-Filgueira** received the Ph.D. degree from the University of Santiago de Compostela, Santiago de Compostela, Spain, in 2012.

She is currently a Post-Doctoral Researcher with the Centro de Investigación en Tecnoloxías da Información, University of Santiago de Compostela. Her current research interests include physical modeling of electronic devices and deep learning application to image processing.



Daniel García-Lesta received the bachelor's degree in physics from the University of Santiago de Compostela, Santiago de Compostela, Spain, in 2014, and the master's degree in electronic systems for information and communication from the National Distance Education University, Madrid, Spain, in 2016. He is currently pursuing the Ph.D. degree at the University of Santiago de Compostela.

His current research interests include microelectronic design, image and video processing, and wireless sensor networks.



Mauro Fernández-Sanjurjo received the bachelor's degree in computer engineering and the master's degree in high performance computing from the University of Santiago de Compostela, Santiago de Compostela, Spain, in 2014 and 2015, respectively, where he is currently pursuing the Ph.D. degree.

His current research interests include computer vision, robotics, and artificial intelligence.



Víctor Manuel Brea received the Ph.D. degree in physics from the University of Santiago de Compostela, Santiago de Compostela, Spain, in 2003.

He is currently an Associate Professor with the Centro de Investigación en Tecnoloxías da Información, University of Santiago de Compostela. His main research interests lie in the design of efficient architectures and CMOS solutions for computer vision and micro energy harvesting for Internet of Things applications.



Paula López (M'02) received the Ph.D. degree in physics from the University of Santiago de Compostela, Santiago de Compostela, Spain, in 2003

She is currently an Associate Professor with the University of Santiago de Compostela. She has authored or co-authored over 50 research papers. Her current research interests include design of mixed signal integrated circuits for image processing applications and physical modeling of electronic devices, particularly CMOS image sensors, as well as the

translation of these models into hardware description languages.